Some basic information about the data:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values. The mean quality is 5.878 and median quality is 6. The highest quality is 9 and the lowest 3.
Fixed acidity ranges from 3.8 to 14.2, whith a median value of 6.8. About 75% of the wines have a volatile acidity less than 0.32 and a citric acidity less than 0.39. For both free sulphur dioxide content and total sulfur dioxide content, there is quite a difference between the max and min values (2.0 vs. 289.00 for free sulfur dioxide and 9.0 vs. 440.0 for total sulfur dioxide. There is less variation in the density of wines, with the minimum being 0.9871, the maximum 1.0390 and a median value of 0.9937. pH values are in the range between 2.72 and 3.82. Alcohol content varies between 8 and 14.2, with a median alcohol content of 10.40
First, I look at the distribution of quality ranks.
## [1] 4898
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are 4898 wines in the data set. From tabling the values we see that each tail is thinly populated. There are only 20 observations having the lowest quality (3) and only 5 having the highest quality (9). There are far most observations having a judged quality of 6, 2198 out of 4898. From plotting a histogram showing the distribution, quality seems to be roughly normally distributed.
Getting a sense of the distribution of the other different variables.
It seems like many of the variables are somewhat normally distributed (although the binwidths are not adjusted). Adjusting binwidths very roughly in the different plots by looking at the scale on the x axis.
After adjusting the binwidth, I’m intrigued by “residual sugar” and “alcohol”, which does not seem to be normally distributed. Also a few of the variables seems to have very long tails.
Looking more closely at the distribution of residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
There is quite a bit of difference when it comes to residual sugar content of the different wines. The mean residual sugar value is 6.391 and the median 5.2. The minimum value is as little as 0.6, while the maximum is 65.8. The 1st quartile value is 1.7 while the 3rd quartile value is 9.9. I plot the distribution to have a closer look at the distribution, which reveals that most wines have a residual sugar content below 20, and with a spike between 1 and 2.
There appears to be some outliers to the far right in the plot, so I make a new plot where I zoom in to get a closer look, which reveals that there are only five wines having a residual sugar value above 25, and only three over 30 (two with 31.60 and one with 65.80 respectively). All of these wines have a judged quality of 6, which is the most common quality level, so they don’t stand out quality-wise.
## [1] 31.60 31.60 65.80 26.05 26.05
## [1] 6 6 6 6 6
Looking more closely at the distribution of alcohol content. The minimum value is 8, and the maximum value 14.2. The median and mean values are 10.4 and 10.51 respectively.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
I’m intrigued by the “spikes”, and add breaks to the x axis to see where they occur.
I wonder if the small “spikes” in the distribution coincide with e.g. round numbers (may e.g. be caused by the manufacturers reporting rounded numbers instead of accurate). I create a modulo function to calculate the ending decimal/modulus in order to get a impression of whether or not numbers are rounded. With this function I create a new variable, the modulus or ending decimal of the alcohol content of each wine. I then create a histogram to plot the distribution of the ending decimals.
In this data set, alcohol values seem to be stated in increments of 0.1. As shown in the plot above, I would argue that there is a higher frequency of wines with alcohol content corresponding to “round numbers”, with a ending decimal of 0 or 5. There are for instance 650 wines with a stated alcohol content with an ending decimal of 0, compared to 410 ending in 0.1 and 387 ending in 0.9. My guess is that this is due to the fact that some producers round the stated alcohol value to a round number.
In this case, this sort of POSSIBLE inaccuracies may not be of much significance, but each such inaccuracy has the potential to slightly affect all other analysis done on the data set (for example fitting linear models).
I will look more closely into the relation between alcohol content and other variables in the bivariate and multivariate analysis.
The dataset contains 4898 observations with 12 features. Quality is an integer value, but apart from that all other features are numeric (float) values.
Mean quality is 5.878 and median quality is 6. Although the quality scale varies from 1 to 10, the highest quality is 9 and the lowest 3.
Quality is the main feature of interest in this dataset.
I’m open minded as to which other features will support my investigation into the quality. I have no particular knowledge of wine chemistry, and as at the beginning of the investigation, I do not have any intuition as to which variables correlate with higher quality rankings.
I created a variable called alcohol.ending.decimal, which is the ending decimal of the stated alcohol content (alcohol contents is stated in increments of 0.1 %). I used the variable to plot the distribution of ending decimals to see if there were a higher occurence of “rounded” numbers with regard to alcohol content, which I believe there is. Since I did not plan on using the variable any further, I dropped it from the data set after conducting the analysis.
As mentioned above in connection with the univariate plots most of the variables seemed to have roughly normal distributions, albeit with very long tails. Residual sugar and alcohol did not seem to be normally distributed.
I did not yet perform any operations to tidy or rearrange the date. The data seems relatively tidy, with each variable as a column and each observation as a row. However, since R studio can deal with numbering each observation (row), I removed the X column.
I started my bivariate analysis by using ggpairs to get an overview of how the different variables relate to each other.
From the plot. I’m noting that alcohol and density seems to have some degree of correlation with the other variables in the data set, but that other than that there does not seem to be much correlation between the variables.
In order to investigate which chemical properties are correlated with higher quality, I decide to group the wine by quality using the summarise function to compute mean and median values for each “quality group”.
The main feature of interest is how the features of the data set relate to quality. I am therefore particularly interested in identifying features that are related to quality.
From running ggpairs to produce a scatter matrix, I recall that alcohol did have the highest correlation with quality, and I want to look into this in closer detail:
Alcohol mean values by quality:
Alcohol median values by quality:
Also density looks promising with regard to correlation and merits a closer look:
Plotting mean total.sulfur.dioxide content by quality.
Plotting median total.sulphur.dioxide content by quality.
Since the mean plot do not show variance, I decided to try looking at the relation between the different features and quality by with boxplots, since they give an indication about the distribution of the variables at each quality level. I therefore plot boxplots of all variables against quality by using grid.arrange:
I want to look further into the relation between alcohol and quality. The below plot shows alcohol level by quality.
There seems to be a tendency for lower quality wines to have lower alcohol content and better quality wines to have higher alcohol content. That being said, there seems to be quite a bith of variance - for example the lower quality wines seems to vary considerably with regard to alcohol content.
As stated in the univariate plots section, I started my analysis of which variables were important for wine quality with an open mind. I therefore decided to plot all the variables in a boxplot using quality on the x axis. For several of the features (e.g. alcohol), there seem to be a polynominal/quadratic relation between the quality and the feature. This is e.g. the case with alcohol, where the highest and lowest quality wines have higher alcohol content, and the medium-low wines have lower alcohol content on average.
There seems to be relations between alcohol and some other variables. In particular there seems to be a relation between alcohol and quality (the feature of interest). The correlation is 0.4355747.
## [1] 0.4355747
This trend can also be shown in a density plot:
Alcohol content, however, also seem to be related to other features, such as total.sulfur.dioxide (correlation of -0.7801376)…
## [1] -0.7801376
… and density (correlation of -0.4488921).
## [1] -0.4488921
The strongest relationship I found was the relationship between density and residual sugar. The correlation here is 0.8389665.
## [1] 0.8389665
## $title
## [1] "Density by residual sugar"
##
## attr(,"class")
## [1] "labels"
Recalling that alcohol and density seemed to have a degree of correlation, I want to see how this relates to quality by adding color for quality:
From the plot above it seems that wines of higher quality are typically higher in alcohol and lower in density.
The plot below shows the relation between total.sulfur.dioxide and density. From running ggpairs, I know they have one of the strongest correlations between the variables. By adding color for quality, I want to see if there is some relation to quality:
It appears higher quality wines have lower density and lower total.sulfur.dioxide.
There also appear to be some correlation between density and residual.sugar level, and I want to se how this relates to quality:
It appears that higher quality wines tend to have less density, and less residual sugar.
I also want to investigate the relation between pH values and quality a bit further:
Except for the very highest and the very lowest quality wines, mean pH across quality groups seem to be relatively similar.
However, the shape of the distribution seem to be slightly different, which is more visible if I use facet wrap to create a separate pH density plot for each level of quality:
Very low quality wines seem to vary much with regard to pH values, whereas the highest quality wine tend to have pH values more clustered together. I wonder whether there is some kind of relation here, or whether it is simply a result of there being few observations at the extreme ends of the quality spectrum.
Total.sulfur.dioxide and free.sulfur.dioxide seem to have some degree of correlation (0.615501), and I want to examine this in closer detail, in particular how this relates to quality:
## [1] 0.615501
It appears that higher quality wines have less total.sulphur.dioxide and more free.sulphur.dioxide. I’m at a loss as to why this might be the case, but a quick google query reveals that sulfur dioxide (SO2) protects wine from oxidation and bacteria. However, too much of it can impact taste.
From this research I understand that free and total sulfur dioxide levels are related. This leaves my curious as to whether the PROPORTION of free to total sulfur dioxide levels have an impact on quality.
I decide to create a new variable free.sulfur.dioxide.proportion which is free.sulfur.dioxide/total.sulfur.dioxide, and plot the results by quality:
## [1] -0.1747372
## [1] 0.008158067
## [1] 0.1972141
The correlation between the proportion of free.sulfur.dioxide to total.sulfur.dioxide does indeed increase by a tiny amount, but with a correlation with quality of 0.1972141, it is not a strong predictor of quality.
Since alcohol is the feature which in itself has the strongest relation with quality, I want to investigate the relation between alcohol and the free sulfur dioxide proportion and their relation to quality by adding color for quality:
From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.
In the multivariate analysis, I looked at the relation between alcohol and density, which seem to strenghten each other in terms of looking at quality. Higher quality wines are typically higher in alcohol and lower in density. The features density and residual sugar also seemed to strengthen each other in terms of looking at quality, with higher quality wines tend to have less density, and less residual sugar. THis is also true for the relation between alcohol and the proportion of free sulfur dioxide. Higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide.
I found it interesting that the distribution of pH values seemed to be so different across different quality levels. Given the low number of observations at the extreme ends of the quality spectrum, however, it is hard to say whether this is a result of a genuine difference between high and low quality wines, or whether it is particular just to this sample of white wines.
N/A
This plot shows density distributions of alcohol content for each quality level. As can be seen from this plot, the wines with lower quality tends to have a lower alcohol content, while higher quality wines tend to have a higher alcohol content. Given the low number of observations this trend is more apparent on the quality levels with many observations (in the quality range 4-8). For wines with quality of 9, the distribution appears to be bimodal. For wines with quality 9, there are, however, only 5 observations. The spike just after 10 % is due to ONE wine ranked 9 with an alcohol percentage of 10.4. The other wines ranked 9 have a alcohol percentage between 12.4 and 12.9. For other quality levels, outliers such as these does not affect the plot to the same degree.
This plot shows the distribution of pH values across the groups of wines with the same quality. For the different quality levels, the distribution seems to vary. I have added vertical line with x intercept at the mean pH value for all white wines to make comparisons across the different quality levels (since mean is 3.188267 and median is 3.18 i did not feel the need to add both). The distribution of pH values seem to vary with the quality level. The lowest quality wines are distributed more evenly across different pH values, ranging from about 2.7 to 3.7. High quality wines, on the other hand, seems to be distributed across a more narrow pH range, ranging from about 3.15 to 3.45. Further, the lower quality wines (especially 4 and 5) seems to have more pH values below the mean, while higher quality wines (8 and 9) have pH values above the mean pH value. This might suggest that more acidic/sour are judged to taste less good.
I have created a new feature which is the proportion of free sulfate dioxide to total sulfate dioxide. The plot is a scatter plot with the proportion of free sulfur dioxide on the x axis and alcohol content on the y axis. Color is added indicating quality level, with lower quality wines have a purple color, medium quality wines green and high quality wines yellow. From the above plot, it appears that higher quality wines have a higher level of alcohol, as well as a higher proportion of free sulfur dioxide and therefore are more likely to be in the upper right part of the plot. Similarily lower quality wines have a tendency to have lower alcohol content and a lower proportion of free sulfur dioxide, and therefore being in the low left part of the plot.
The white wine data set contains information on 4898 white wine variants of the Portuguese “Vinho Verde” wine. My overall goal with the analysis was to uncover a relation between the different features and wine quality. From my analysis of the different features of the dataset, it appears that there is a connection between some of the features and wine quality. Alcohol level in particular appears to be correlated with higher quality wines. However, even though there is some relation between the different features it was not as pronounced as the strong, linear relation between price and carat in the diamonds dataset. I would say it was a bit disappointing not to uncover a stronger relationship. However, it would on the other hand be surprising if something as complex as the subjective taste of wine could be broken down to 12 chemical properties. There are likely interactions between the chemical properties that all work out to produce the subjective experience of the wine. Some of the analysis might be influenced by the fact that there are very few observations at the extreme ends of quality. For example there are only 20 observations of wines judged to be of quality 3 and only 5 for the highest quality wines judged to be of quality 9. The data set is only related to wines from a region in Portugal. It would be interesting to investigate whether the findings in this dataset would be different if wines from a different region or a range of regions were used. The data also seems to be limited to one year. It would also be interesting to see year on year change values, particularly as one often hear that wine producers talk about “good years” and “bad years”. It would be interesting to see if the chemical properties of the wine changes from a good year to a bad year.